ST405/ST645 Bayesian Data Analysis

Tutorial Questions (3)

Author

Prof. Niamh Cahill

Solutions

Question: Bayesian Regression Model - Fisherys Data

  1. Visualize the Snapper Length Data
ggplot(fish_dat, aes(x = snapper_length)) +
  geom_histogram(bins = 40,colour = "blue")


  1. Fit a Basic Bayesian Model

    • A Bayesian model that assumes a Normal likelihood for the length data, with an overall mean and variance (ignoring age groups for now).
## model spec
normmodel ="
model{
    for (i in 1:n){
        y.i[i] ~ dnorm(mu, sigma^-2)
    }

## priors
mu ~ dnorm(0,100^-2)  # vague prior
sigma ~ dunif(0,30) # using variation not going beyond 30

## data reps
for(i in 1:n){yrep[i] ~ dnorm(mu,sigma^-2)}
}
"

## data and parameters
jags.data <- list(y.i = fish_dat$snapper_length, 
                  n = nrow(fish_dat))

parnames <- c("mu","sigma","yrep")

## run JAGS
mod <- jags(data = jags.data, 
            parameters.to.save=parnames, 
            model.file = textConnection(normmodel),
            n.iter = 5000,
            n.burnin = 1000,
            n.thin = 2)
module glm loaded
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 256
   Unobserved stochastic nodes: 258
   Total graph size: 522

Initializing model
## output
m <- mod$BUGSoutput$sims.matrix

  1. Assess Model Convergence
plot(mod) # quick check for convergence

  1. Summarize Model Parameters
par_summary <- m %>% 
                gather_rvars(mu,sigma) %>% 
                median_qi(.value)
par_summary
# A tibble: 2 × 7
  .variable .value .lower .upper .width .point .interval
  <chr>      <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
1 mu          6.10   5.87   6.34   0.95 median qi       
2 sigma       1.91   1.75   2.09   0.95 median qi       
  1. Conduct Posterior Predictive Checks

    (i) Compare Distributions

    • Plot the observed distribution of \(y\) and overlay it with the distributions of 50 simulated datasets (\(y_{\text{rep}}\)) from the model.
y <- fish_dat$snapper_length
yrep <- mod$BUGSoutput$sims.list$yrep 
ppc_dens_overlay(y, yrep[1:50, ]) 

This is ok, but not great. The mean in the yreps is shifted to the right and overdoing it on the lower end with higher density for the lower extremes than the observed data would suggest.

(ii) Evaluate Test Statistics

  • Calculate the empirical values for the following test statistics based on the observed data:
    • 1st percentile
    • Median
    • 97th percentile
  • Compare these empirical values with the corresponding posterior predictive distributions of test statistics.
## test statistics
ppc_stat(y,yrep, stat = "median") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The median is being overestimated in yreps rel. to the observed data

q1 <- function(y) quantile(y, 0.01) # create a function for the quantile (pp_stat help files give an example of this)
ppc_stat(y,yrep, stat = "q1")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The lower extremes of yreps are mostly lower than observed lower extreme (based on 1st percentile).

q97 <- function(y) quantile(y, 0.97) # same but for 97th percentile
ppc_stat(y,yrep, stat = "q97")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The upper extremes of yreps mostly lower than observed upper extreme.

(iii) Posterior Predictive \(p\)-values and Effect Sizes

# do median first
med_rep <- ppc_stat(y,yrep, stat = "median") # the data in this object contains medians for each yrep

## p-value (median)
sum(med_rep$data$value >= median(fish_dat$snapper_length))/length(med_rep$data$value)
[1] 0.9931667

There’s a 99.4% chance of seeing a median as or more extreme than what was observed based on this model.

## effect size (median)
(median(fish_dat$snapper_length - median(med_rep$data$value)))/sd(med_rep$data$value)
[1] -2.508263

The effect size is large reflecting what we saw in the posterior predictive probability.

# 1st percentile 
perc_1_rep <- ppc_stat(y,yrep, stat = "q1") # the data in this object contains 1st percentiles for each yrep

## p-value (1st percentile)
1- sum(perc_1_rep$data$value >= quantile(fish_dat$snapper_length,probs = 0.01))/length(perc_1_rep$data$value)
[1] 0.997

There’s a 99.5% chance of seeing a 1st percentile as or more extreme than what was observed, based on this model. So the model has lower extremes beyond what is suggested by the observed data.

## effect size (median)
(quantile(fish_dat$snapper_length,probs = 0.01) - median(perc_1_rep$data$value))/sd(perc_1_rep$data$value)
      1% 
2.276166 

Large effect size again.

# 97th percentile 
perc_97_rep <- ppc_stat(y,yrep, stat = "q97") # the data in this object contains 1st percentiles for each yrep

## p-value (97th percentile)
sum(perc_1_rep$data$value >= quantile(fish_dat$snapper_length,probs = 0.01))/length(perc_1_rep$data$value)
[1] 0.003

There’s a 0.5% chance of seeing a 97th percentile as or more extreme than what was observed, based on this model. The model might underestimate the upper extremes.

## effect size (median)
(quantile(fish_dat$snapper_length,probs = 0.01) - median(perc_1_rep$data$value))/sd(perc_1_rep$data$value)
      1% 
2.276166 

  1. Fit a Group-Specific Bayesian Model
## Model spec
mixmodel ="
model{
    for (i in 1:n){
        y.i[i] ~ dnorm(mu.g[group[i]], sigma.g[group[i]]^-2)
    }

## priors
for(g in 1:ng)
{
mu.g[g] ~ dnorm(0,100^-2)
sigma.g[g] ~ dunif(0,30) 
}

for(i in 1:n){yrep[i] ~ dnorm(mu.g[group[i]],sigma.g[group[i]]^-2)}
}
"

## data and parameters
jags.data <- list(y.i = fish_dat$snapper_length, 
                  n = nrow(fish_dat),
                  group = fish_dat$group,
                  ng = 2)

parnames <- c("mu.g","sigma.g","yrep")

## run JAGS
mod <- jags(data = jags.data, 
             parameters.to.save=parnames, 
             model.file = textConnection(mixmodel),
             n.iter = 5000,
             n.burnin = 1000,
             n.thin = 2)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 256
   Unobserved stochastic nodes: 260
   Total graph size: 782

Initializing model
## output
m <- mod$BUGSoutput$sims.matrix
plot(mod)

# (i)
y <- fish_dat$snapper_length
yrep <- mod$BUGSoutput$sims.list$yrep
ppc_dens_overlay(y, yrep[1:50, ])

This is alot better than the 1st model

# (ii)
## test statistics
ppc_stat(y,yrep, stat = "median") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

q1 <- function(y) quantile(y, 0.01) 
ppc_stat(y,yrep, stat = "q1") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

q97 <- function(y) quantile(y, 0.97) 
ppc_stat(y,yrep, stat = "q97") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

All of these have improved.

# (iii)
med_rep <- ppc_stat(y,yrep, stat = "median") # the data in this object contains medians for each yrep

## p-value
sum(med_rep$data$value >= median(fish_dat$snapper_length))/length(med_rep$data$value)
[1] 0.5486667

There’s a 55% chance of seeing a median as or more extreme than what was observed based on this model. This is an improvement on the 1st model.

## effect size

(median(fish_dat$snapper_length - median(med_rep$data$value)))/sd(med_rep$data$value)
[1] -0.11525

There’s a much smaller effect size


Notes

  • The posterior predictive ( p )-value indicates how many of the simulated datasets have test statistic values as extreme or more extreme than the value observed in the empirical data. It reflects the degree to which the model can replicate observed data characteristics.

  • The effect size quantifies the distance between the observed test statistic and the median of the posterior predictive distribution in standard deviation units. This gives an indication of how unusual the observed data is relative to the model predictions.